Bypass the systemd service restart limit and do immediately restart when change to local mode #15432

lixiaoyuner · 2023-06-12T07:45:33Z

Why I did it

During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart.
When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time.

Work item tracking

Microsoft ADO (number only):
24172368

How I did it

Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit.
When need to go back to local mode, we do systemd restart immediately.

How to verif it

Feature's systemd service can be always restarted successfully during upgrade process via k8s.

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

20220531.28

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

rules/config

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py

src/sonic-ctrmgrd/ctrmgr/kube_commands.py

qiluo-msft · 2023-06-22T07:03:10Z

Your code changes is more than what you said in PR title and PR description. Could you update them?

In reply to: 1602113471

lixiaoyuner · 2023-06-22T16:31:51Z

Your code changes is more than what you said in PR title and PR description. Could you update them?

Thanks for your comments, have updated, could you please go ahead to review.

lguohan · 2023-06-30T16:47:02Z

for the first pr, i think it should be a separate pr. I spent quite some time to figure where which line code map to this reset failed. even for this feature, we need to explain on this "It's easy to meet this limit when upgrade and fallback happen at the same time." why? i couldn't figure out.

lixiaoyuner · 2023-07-03T09:26:54Z

for the first pr, i think it should be a separate pr. I spent quite some time to figure where which line code map to this reset failed. even for this feature, we need to explain on this "It's easy to meet this limit when upgrade and fallback happen at the same time." why? i couldn't figure out.

Maybe the word "easy" cause it confused, actually not that easy, it's an extreme case. Let me explain first how the systemd sevice restart when k8s upgrade the container. For example v1(kube) --> v2(kube), k8s will stop v1 container first, this time the systemd service is doing "docker wait v1-id", once the v1 container stops, the "docker wait v1-id" will return error code and the systemd service will exit with error code, due to the restart policy, the systemd service will restart, and the failed number will +1. But the failed number limit is only 3 within 20 minutes, it means if we do three times upgrade or fallback within 20 minutes, the systemd service will be never up. For a possible example, if fallback happens when upgrade, the failed number will be 2, just once again, the systemd service will be down. So, we need to reset-failed number before we do systemd restart.

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py

lguohan · 2023-07-05T21:48:47Z

k8s will stop v1 container first, this time the systemd service is doing "docker wait v1-id", once the v1 container stops, the "docker wait v1-id" will return error code and the systemd service will exit with error code,

in this case, it is planned stop, so why docker wait will return error code? can we make this like planned stopped with error code = 0?

lixiaoyuner · 2023-07-06T02:27:04Z

in this case, it is planned stop, so why docker wait will return error code? can we make this like planned stopped with error code = 0?

The reason is that k8s will kill the container, so the docker wait result is not zero. There is one way that we can check whether the wait id is the feature name or not. Feature name means it's a local container. Not feature name means it's a kube container, if it's a kube container, after docker wait returns, we could not care the docker wait returns code and we can return 0 directly, this maybe a solution. I can try to implement to verify it's a feasible solution.

Latest reply
I did a quick test, after changed the exit code to 0, it still doesn't work. I thought systemd would use the failed exit count, but actually it uses the service start count, don't care failed exit and successful exit last time. Once the service starts three times within 20 minutes, it will fail to restart again. So, we can only use "systemctl reset-failed" command to bypass.
I paste our service's configuration and systemd official doc about the limit below.

reference:
Our systemd service configuration:
StartLimitIntervalSec=1200
StartLimitBurst=3
systemd service exlaination:
Configure unit start rate limiting. Units which are started more than burst times within an interval time span are not permitted to start any more.

…/sonic-buildimage into bypass-systemd-restart-limit

…start when change to local mode (sonic-net#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.

…start when change to local mode (#15432) (#15839) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.

…start when change to local mode (sonic-net#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.

mssonicbld · 2023-07-17T15:25:03Z

Cherry-pick PR to 202305: #15868

…start when change to local mode (sonic-net#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.

mssonicbld · 2023-07-17T15:25:28Z

Cherry-pick PR to 202211: #15869

…start when change to local mode (#15432) (#15868)

…start when change to local mode (#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.

…start when change to local mode (sonic-net#15432) Why I did it During the upgrade process via k8s, the feature's systemd service will restart as well, all of the feature systemd service has restart number limit, and the limit number is too small, only three times. if fallback happens when upgrade, the start count will be 2, just once again, the systemd service will be down. So, need to bypass this. This restart function will be called when do local -> kube, kube -> kube, kube ->local, each time call this function, we indeed need to restart successfully, so do reset-failed every time we do restart. When need to go back to local mode, we do systemd restart immediately without waiting the default restart interval time so that we can reduce the container down time. Work item tracking Microsoft ADO (number only): 24172368 How I did it Before every restart for upgrade, do reset feature's restart number. The restart number will be reset to 0 to bypass the restart limit. When need to go back to local mode, we do systemd restart immediately. How to verify it Feature's systemd service can be always restarted successfully during upgrade process via k8s.

Bypass the systemd service restart limit

97d777f

lixiaoyuner requested a review from lguohan as a code owner June 12, 2023 07:45

lixiaoyuner added 3 commits June 12, 2023 11:13

Use subprocess call func to replace run func

e298b07

No need to clean up image when it's unstable

de13784

Include k8s for test

7cfdf6b

lixiaoyuner requested review from qiluo-msft and xumia as code owners June 13, 2023 01:57

losha228 reviewed Jun 14, 2023

View reviewed changes

rules/config Outdated Show resolved Hide resolved

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py Outdated Show resolved Hide resolved

src/sonic-ctrmgrd/ctrmgr/kube_commands.py Outdated Show resolved Hide resolved

lixiaoyuner added 2 commits June 15, 2023 07:08

Add test cases

3b1fd7f

Disable k8s as default

78928c8

losha228 approved these changes Jun 19, 2023

View reviewed changes

lixiaoyuner changed the title ~~Bypass the systemd service restart limit~~ Bypass the systemd service restart limit and do immediately restart when change to local mode and only do image clean up when do tag latest Jun 22, 2023

Fix same version images have different ids issue

fb06c45

losha228 self-requested a review June 27, 2023 11:26

losha228 approved these changes Jun 27, 2023

View reviewed changes

losha228 self-requested a review June 27, 2023 11:26

losha228 approved these changes Jun 27, 2023

View reviewed changes

qiluo-msft reviewed Jul 4, 2023

View reviewed changes

src/sonic-ctrmgrd/ctrmgr/ctrmgrd.py Outdated Show resolved Hide resolved

lixiaoyuner and others added 2 commits July 5, 2023 02:19

Fix typo

58ef09d

Merge branch 'master' into bypass-systemd-restart-limit

9b84fb3

lixiaoyuner added 2 commits July 7, 2023 15:57

Remove the image clean code to split to another PR

9032bcb

Merge branch 'bypass-systemd-restart-limit' of github.com:lixiaoyuner…

3c96b31

…/sonic-buildimage into bypass-systemd-restart-limit

lixiaoyuner changed the title ~~Bypass the systemd service restart limit and do immediately restart when change to local mode and only do image clean up when do tag latest~~ Bypass the systemd service restart limit and do immediately restart when change to local mode Jul 7, 2023

Check current owner before restart

ac728f9

lguohan approved these changes Jul 14, 2023

View reviewed changes

lguohan merged commit df13380 into sonic-net:master Jul 14, 2023
16 checks passed

lguohan added Request for 202205 Branch Request for 202305 Branch labels Jul 14, 2023

lixiaoyuner added the Request for 202211 Branch label Jul 14, 2023

yxieca added the Included in 202205 Branch label Jul 14, 2023

StormLiangMS added the Approved for 202305 Branch label Jul 17, 2023

mssonicbld added the Created PR to 202305 Branch label Jul 17, 2023

StormLiangMS added Approved for 202211 Branch and removed Created PR to 202305 Branch labels Jul 17, 2023

mssonicbld mentioned this pull request Jul 17, 2023

[action] [PR:15432] Bypass the systemd service restart limit and do immediately restart when change to local mode #15868

Merged

10 tasks

StormLiangMS removed the Approved for 202211 Branch label Jul 17, 2023

mssonicbld added the Created PR to 202211 Branch label Jul 17, 2023

mssonicbld mentioned this pull request Jul 17, 2023

[action] [PR:15432] Bypass the systemd service restart limit and do immediately restart when change to local mode #15869

Merged

10 tasks

mssonicbld added a commit that referenced this pull request Jul 19, 2023

[k8s]: Bypass the systemd service restart limit and do immediately re…

f4a7e22

…start when change to local mode (#15432) (#15868)

mssonicbld added the Included in 202305 Branch label Jul 19, 2023

mssonicbld added Included in 202211 Branch and removed Created PR to 202211 Branch labels Jul 20, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bypass the systemd service restart limit and do immediately restart when change to local mode #15432

Bypass the systemd service restart limit and do immediately restart when change to local mode #15432

lixiaoyuner commented Jun 12, 2023 •

edited

Loading

qiluo-msft commented Jun 22, 2023 •

edited

Loading

lixiaoyuner commented Jun 22, 2023

lguohan commented Jun 30, 2023

lixiaoyuner commented Jul 3, 2023 •

edited

Loading

lguohan commented Jul 5, 2023

lixiaoyuner commented Jul 6, 2023 •

edited

Loading

mssonicbld commented Jul 17, 2023

mssonicbld commented Jul 17, 2023

Bypass the systemd service restart limit and do immediately restart when change to local mode #15432

Bypass the systemd service restart limit and do immediately restart when change to local mode #15432

Conversation

lixiaoyuner commented Jun 12, 2023 • edited Loading

Why I did it

Work item tracking

How I did it

How to verif it

Which release branch to backport (provide reason below if selected)

Tested branch (Please provide the tested image version)

Description for the changelog

Link to config_db schema for YANG module changes

A picture of a cute animal (not mandatory but encouraged)

qiluo-msft commented Jun 22, 2023 • edited Loading

lixiaoyuner commented Jun 22, 2023

lguohan commented Jun 30, 2023

lixiaoyuner commented Jul 3, 2023 • edited Loading

lguohan commented Jul 5, 2023

lixiaoyuner commented Jul 6, 2023 • edited Loading

mssonicbld commented Jul 17, 2023

mssonicbld commented Jul 17, 2023

lixiaoyuner commented Jun 12, 2023 •

edited

Loading

qiluo-msft commented Jun 22, 2023 •

edited

Loading

lixiaoyuner commented Jul 3, 2023 •

edited

Loading

lixiaoyuner commented Jul 6, 2023 •

edited

Loading